Selo de Nova York

New York City Shooting Incident Data Analysis

Author: Willian Pina

About the DataSet and Project

This dataset encompasses a comprehensive collection of all shooting incidents that have occurred in New York City from 2006 to the end of the last calendar year. It is updated quarterly and reviewed by the NYPD’s Office of Management Analysis and Planning before being made available to the public. Each record includes details about the incident such as the date, time, location, and demographic information about suspects and victims. This dataset serves as a valuable tool for analyzing the nature of criminal and shooting activity in NYC.

This project is part of the Master’s program in Data Science at the University of Colorado Boulder, taught by Professor Dr. Jane Wall, within the course “Data Science as a Field”.

Dataset Description

Column Description
INCIDENT_KEY Randomly generated persistent ID for each incident
OCCUR_DATE Exact date of the shooting incident
OCCUR_TIME Exact time of the shooting incident
BORO Borough where the incident occurred
LOC_OF_OCCUR_DESC Description of the incident location
PRECINCT Precinct where the incident occurred
JURISDICTION_CODE Jurisdiction code where the incident occurred
LOC_CLASSFCTN_DESC Description of the location classification
LOCATION_DESC Description of the incident location
STATISTICAL_MURDER_FLAG Indicates whether the shooting resulted in a victim’s death, counted as a murder
PERP_AGE_GROUP Age category of the perpetrator
PERP_SEX Sex of the perpetrator
PERP_RACE Race of the perpetrator
VIC_AGE_GROUP Age category of the victim
VIC_SEX Sex of the victim
VIC_RACE Race of the victim
X_COORD_CD Midblock X-coordinate for the New York State Plane Coordinate System
Y_COORD_CD Midblock Y-coordinate for the New York State Plane Coordinate System
Latitude Latitude coordinate for the global coordinate system
Longitude Longitude coordinate for the global coordinate system
Lon_Lat Longitude and latitude coordinates for mapping

For more information and access to the data, visit the dataset link: NYPD Shooting Incident Data (Historic).

Import data

We will import the data from the URL provided in the dataset source to begin our data analysis process of the dataset.

URL = "https://data.cityofnewyork.us/api/views/833y-fsy8/rows.csv?accessType=DOWNLOAD"

data = read.csv(URL)
head(data)
##   INCIDENT_KEY OCCUR_DATE OCCUR_TIME      BORO LOC_OF_OCCUR_DESC PRECINCT
## 1    244608249 05/05/2022   00:10:00 MANHATTAN            INSIDE       14
## 2    247542571 07/04/2022   22:20:00     BRONX           OUTSIDE       48
## 3     84967535 05/27/2012   19:35:00    QUEENS                        103
## 4    202853370 09/24/2019   21:00:00     BRONX                         42
## 5     27078636 02/25/2007   21:00:00  BROOKLYN                         83
## 6    230311078 07/01/2021   23:07:00 MANHATTAN                         23
##   JURISDICTION_CODE LOC_CLASSFCTN_DESC             LOCATION_DESC
## 1                 0         COMMERCIAL               VIDEO STORE
## 2                 0             STREET                    (null)
## 3                 0                                             
## 4                 0                                             
## 5                 0                                             
## 6                 2                    MULTI DWELL - PUBLIC HOUS
##   STATISTICAL_MURDER_FLAG PERP_AGE_GROUP PERP_SEX PERP_RACE VIC_AGE_GROUP
## 1                    true          25-44        M     BLACK         25-44
## 2                    true         (null)   (null)    (null)         18-24
## 3                   false                                           18-24
## 4                   false          25-44        M   UNKNOWN         25-44
## 5                   false          25-44        M     BLACK         25-44
## 6                   false                                           25-44
##   VIC_SEX VIC_RACE X_COORD_CD Y_COORD_CD Latitude Longitude
## 1       M    BLACK     986050   214231.0 40.75469 -73.99350
## 2       M    BLACK    1016802   250581.0 40.85440 -73.88233
## 3       M    BLACK    1048632   198262.0 40.71063 -73.76777
## 4       M    BLACK    1014493   242565.0 40.83242 -73.89071
## 5       M    BLACK    1009149   190104.7 40.68844 -73.91022
## 6       M    BLACK     999061   229912.0 40.79773 -73.94651
##                                         Lon_Lat
## 1                    POINT (-73.9935 40.754692)
## 2                   POINT (-73.88233 40.854402)
## 3  POINT (-73.76777349199995 40.71063412500007)
## 4 POINT (-73.89071440599997 40.832416753000075)
## 5  POINT (-73.91021857399994 40.68844345900004)
## 6  POINT (-73.94650786199998 40.79772716600007)

Clean and tidy data

Let’s start by analyzing and preparing data from the NYPD Shooting Incident Dataset.

Let’s follow these steps:

  1. Load the data
  2. Perform initial cleaning, such as converting data types and removing unnecessary columns
  3. Check for missing data.

Based on this initial analysis, we can decide how to handle any missing values.

# Summary data
summary(data)
##   INCIDENT_KEY        OCCUR_DATE         OCCUR_TIME            BORO          
##  Min.   :  9953245   Length:28562       Length:28562       Length:28562      
##  1st Qu.: 65439914   Class :character   Class :character   Class :character  
##  Median : 92711254   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :127405824                                                           
##  3rd Qu.:203131993                                                           
##  Max.   :279758069                                                           
##                                                                              
##  LOC_OF_OCCUR_DESC     PRECINCT     JURISDICTION_CODE LOC_CLASSFCTN_DESC
##  Length:28562       Min.   :  1.0   Min.   :0.0000    Length:28562      
##  Class :character   1st Qu.: 44.0   1st Qu.:0.0000    Class :character  
##  Mode  :character   Median : 67.0   Median :0.0000    Mode  :character  
##                     Mean   : 65.5   Mean   :0.3219                      
##                     3rd Qu.: 81.0   3rd Qu.:0.0000                      
##                     Max.   :123.0   Max.   :2.0000                      
##                                     NA's   :2                           
##  LOCATION_DESC      STATISTICAL_MURDER_FLAG PERP_AGE_GROUP    
##  Length:28562       Length:28562            Length:28562      
##  Class :character   Class :character        Class :character  
##  Mode  :character   Mode  :character        Mode  :character  
##                                                               
##                                                               
##                                                               
##                                                               
##    PERP_SEX          PERP_RACE         VIC_AGE_GROUP        VIC_SEX         
##  Length:28562       Length:28562       Length:28562       Length:28562      
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##    VIC_RACE           X_COORD_CD        Y_COORD_CD        Latitude    
##  Length:28562       Min.   : 914928   Min.   :125757   Min.   :40.51  
##  Class :character   1st Qu.:1000068   1st Qu.:182912   1st Qu.:40.67  
##  Mode  :character   Median :1007772   Median :194901   Median :40.70  
##                     Mean   :1009424   Mean   :208380   Mean   :40.74  
##                     3rd Qu.:1016807   3rd Qu.:239814   3rd Qu.:40.82  
##                     Max.   :1066815   Max.   :271128   Max.   :40.91  
##                                                        NA's   :59     
##    Longitude        Lon_Lat         
##  Min.   :-74.25   Length:28562      
##  1st Qu.:-73.94   Class :character  
##  Median :-73.92   Mode  :character  
##  Mean   :-73.91                     
##  3rd Qu.:-73.88                     
##  Max.   :-73.70                     
##  NA's   :59
# Converting date and time.
data$OCCUR_DATE = mdy(data$OCCUR_DATE)
data$OCCUR_TIME = hms(data$OCCUR_TIME)


# Converting variables to factor and logical types.
data$BORO = as.factor(data$BORO)
data$PERP_SEX = as.factor(data$PERP_SEX)
data$PERP_RACE = as.factor(data$PERP_RACE)
data$VIC_SEX = as.factor(data$VIC_SEX)
data$VIC_RACE = as.factor(data$VIC_RACE)
data$STATISTICAL_MURDER_FLAG = as.logical(data$STATISTICAL_MURDER_FLAG)

# Removing unnecessary columns
data_clean <- data %>%
  filter(complete.cases(data)) %>%
  select(-c(X_COORD_CD, Y_COORD_CD, Lon_Lat)) %>%
  filter(VIC_AGE_GROUP != "1022", !is.na(VIC_AGE_GROUP))

As observed, the columns Latitude, Longitude, and JURISDICTION_CODE contained a small amount of missing data. Given the relatively minor proportion of these missing entries compared to the overall dataset size, we have decided to permanently remove these rows from the dataset.

Following these modifications, we will proceed to verify the absence of any remaining missing data and confirm the successful exclusion of the specified columns from the dataset.

# Summary data clean
summary(data_clean)
##   INCIDENT_KEY         OCCUR_DATE           OCCUR_TIME                       
##  Min.   :  9953245   Min.   :2006-01-01   Min.   :0S                         
##  1st Qu.: 65274632   1st Qu.:2009-08-31   1st Qu.:3H 30M 0S                  
##  Median : 92550364   Median :2013-09-09   Median :15H 14M 0S                 
##  Mean   :127113912   Mean   :2014-05-31   Mean   :12H 43M 53.7810526315807S  
##  3rd Qu.:202504684   3rd Qu.:2019-09-15   3rd Qu.:20H 45M 0S                 
##  Max.   :279758069   Max.   :2023-12-29   Max.   :23H 59M 0S                 
##                                                                              
##             BORO       LOC_OF_OCCUR_DESC     PRECINCT     JURISDICTION_CODE
##  BRONX        : 8363   Length:28500       Min.   :  1.0   Min.   :0.0000   
##  BROOKLYN     :11331   Class :character   1st Qu.: 44.0   1st Qu.:0.0000   
##  MANHATTAN    : 3744   Mode  :character   Median : 67.0   Median :0.0000   
##  QUEENS       : 4262                      Mean   : 65.5   Mean   :0.3225   
##  STATEN ISLAND:  800                      3rd Qu.: 81.0   3rd Qu.:0.0000   
##                                           Max.   :123.0   Max.   :2.0000   
##                                                                            
##  LOC_CLASSFCTN_DESC LOCATION_DESC      STATISTICAL_MURDER_FLAG
##  Length:28500       Length:28500       Mode :logical          
##  Class :character   Class :character   FALSE:22978            
##  Mode  :character   Mode  :character   TRUE :5522             
##                                                               
##                                                               
##                                                               
##                                                               
##  PERP_AGE_GROUP       PERP_SEX              PERP_RACE     VIC_AGE_GROUP     
##  Length:28500             : 9310   BLACK         :11879   Length:28500      
##  Class :character   (null): 1115                 : 9310   Class :character  
##  Mode  :character   F     :  443   WHITE HISPANIC: 2502   Mode  :character  
##                     M     :16133   UNKNOWN       : 1837                     
##                     U     : 1499   BLACK HISPANIC: 1388                     
##                                    (null)        : 1115                     
##                                    (Other)       :  469                     
##  VIC_SEX                             VIC_RACE        Latitude    
##  F: 2753   AMERICAN INDIAN/ALASKAN NATIVE:   11   Min.   :40.51  
##  M:25735   ASIAN / PACIFIC ISLANDER      :  440   1st Qu.:40.67  
##  U:   12   BLACK                         :20200   Median :40.70  
##            BLACK HISPANIC                : 2787   Mean   :40.74  
##            UNKNOWN                       :   70   3rd Qu.:40.82  
##            WHITE                         :  728   Max.   :40.91  
##            WHITE HISPANIC                : 4264                  
##    Longitude     
##  Min.   :-74.25  
##  1st Qu.:-73.94  
##  Median :-73.92  
##  Mean   :-73.91  
##  3rd Qu.:-73.88  
##  Max.   :-73.70  
## 

Visualizations and Analysis

From now on we will do some visualizations to test some hypotheses and extract some insights.

Our dataset contains georeferenced information about criminal incidents in the state of New York, including the age of the victims. We can use this data to plot the locations of these incidents on a map and segment them by the victims’ age groups. This analysis will allow us to identify if there are specific areas in the state where certain age profiles of victims are more frequently associated with criminal incidents. Thus, we can visually explore the geographic distribution of incidents and investigate potential patterns related to the age of the victims.

# Get a basemap
map_data <- get_stadiamap(bbox = c(left = min(data_clean$Longitude) + 0.01, bottom = min(data_clean$Latitude) +  0.01, right = max(data_clean$Longitude) + 0.01, top = max(data_clean$Latitude)+  0.01), maptype = "stamen_toner_lite")


# Create the chart
gg <- ggmap(map_data) +
  geom_point(data = data_clean, aes(x = Longitude, y = Latitude, color = VIC_AGE_GROUP), alpha = 0.5, size = 3) +
  scale_color_manual(values = c("18-24" = "blue", "25-44" = "red", "45-64" = "green", "65+" = "yellow", "<18" = "purple", "UNKNOWN" = "grey"),
                     name = "Age Range") +
  labs(title = "Map of Shooting Incidents by Victim Age Category",
       subtitle = "NYPD Shooting Incident Data",
       caption = "Source: NYPD Shooting Incident Data") +
  theme_minimal() +
  theme(plot.title = element_text(size = 16),
        plot.subtitle = element_text(size = 14),
        plot.caption = element_text(size = 12),
        legend.title = element_text(size = 14),
        legend.text = element_text(size = 12),
        axis.title = element_blank(), 
        axis.text = element_blank(),  
        axis.ticks = element_blank())

# Show the graph
print(gg)
## Warning: Removed 4 rows containing missing values or values outside the scale range
## (`geom_point()`).

Based on the analysis of the image, it is observed that there is no specific location with a predominance of crimes according to the demographic profile. However, there is a significant trend of crimes involving victims aged 25 to 44 years, indicated by the predominant red color.

Additionally, it is noted that the island to the west in the state of New York shows few incidents. This region, primarily characterized as a park area, naturally has less foot traffic, which may explain the low incidence of reported crimes there.

To confirm the initial observation that the majority of the victims belong to the age group of 25 to 45 years, a more detailed analysis of the data will be conducted.

# Group data by victim's age category and count events
age_data <- data_clean %>%
  group_by(VIC_AGE_GROUP) %>%
  summarise(Count = n(), .groups = 'drop')

# Create the bar chart
gg <- ggplot(age_data, aes(x = VIC_AGE_GROUP, y = Count, fill = VIC_AGE_GROUP)) +
  geom_bar(stat = "identity", color = "black") +
  labs(title = "Number of Shooting Incidents by Victim Age Category",
       x = "Victim Age Category",
       y = "Number of Incidents",
       fill = "Age Category") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))   

# Plot the graph
print(gg)

The analysis of the data reveals that individuals aged 25 to 44 are the most affected, followed by young people between 18 and 24 years old.

To analyze whether there are significant differences between the genders of criminals in terms of criminal activity over time, we can construct a line graph that presents the monthly aggregation of incidents.

This graph will allow us to observe trends and discrepancies between male and female categories over the months. Visualizing these trends can provide valuable insights into the patterns of criminal behavior associated with each gender.

# Filter only entries where the gender of the perpetrator is known
data_clean <- data %>%
  filter(PERP_SEX %in% c("M", "F"))

# Reorder the factors so that M is above F in the legend
data_clean$PERP_SEX <- factor(data_clean$PERP_SEX, levels = c("M", "F"))

# Group data by month/year and gender of perpetrator
timeline_data <- data_clean %>%
  group_by(Month = floor_date(OCCUR_DATE, "month"), PERP_SEX) %>%
  summarise(Count = n(), .groups = 'drop')

# Create the line chart
gg <- ggplot(timeline_data, aes(x = Month, y = Count, color = PERP_SEX, group = PERP_SEX)) +
  geom_line() +
  labs(title = "Monthly Timeline of Shooting Incidents by Perpetrator's Sex",
       x = "Date",
       y = "Number of Incidents",
       color = "Perpetrator's Sex") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Show the graph
print(gg)

We verified that in the space-time considered, we observed some interesting points:

  1. There is a predominance of males as perpetrators of this type of crime.
  2. The presence of higher peaks in the data may indicate seasonality in the incidents.
  3. Between the years 2015 and 2020, there was a decrease in the number of incidents, followed by a phase of stability.
  4. Since 2005, there has been a trend of declining incidents.
  5. Incidents attributed to females have remained at a stable level over the years, with slight variations.

Our dataset includes the OCCUR_TIME column, which records the exact time of each incident. Using this information, we can analyze and identify the time periods during which incidents are most frequent in the state of New York.

This analysis will allow us to better understand the temporal patterns of the incidents and potentially direct prevention and response measures more effectively.

# data_clean$OCCUR_TIME <- hms(data_clean$OCCUR_TIME)
data_clean$Hour <- hour(data_clean$OCCUR_TIME)

# Ensure that all times are represented even if there are no incidents
full_hours <- data.frame(Hour = 0:23)
hourly_data <- full_hours %>%
  left_join(data_clean %>% group_by(Hour) %>% summarise(Incidents = n(), .groups = 'drop'), by = "Hour") %>%
  replace_na(list(Incidents = 0))

# Generate labels for hours
hourly_data$HourLabel <- sprintf("%02d:00", hourly_data$Hour)

# Create the radial graph
fig <- plot_ly(
  data = hourly_data,
  type = 'scatterpolar',
  mode = 'lines+markers',
  r = hourly_data$Incidents,
  theta = hourly_data$HourLabel,
  fill = 'toself',
  line = list(color = 'blue')
) %>%
  layout(
    polar = list(
      radialaxis = list(
        visible = T,
        range = c(0, max(hourly_data$Incidents) + 10)
      ),
      angularaxis = list(
        direction = "clockwise",  # Set to clockwise
        rotation = 90,  
        type = 'category',
        showline = FALSE,
        tickmode = 'array',
        tickvals = hourly_data$Hour,
        ticktext = hourly_data$HourLabel
      )
    ),
    title = "Number of Incidents during the Day",
    margin = list(t = 100)  
  )

# Show the graph
fig

The analysis of the radial graph showing the frequency of occurrences by hour reveals that the period between 21:00 and 23:00 has the highest incidence of incidents. Conversely, the hours between 05:00 and 13:00 show a significant reduction in the number of events.

Interestingly, there is an escalation in occurrences starting at 18:00, which suggests an increase in the likelihood of incidents during this time. This can be attributed to people’s behavior as they are either returning home or going out for evening activities after the end of the workday, thus increasing their exposure to potential incidents.

Conclusions

Exploring the dataset on shooting incidents in New York, we identify it as a valuable tool for society to assist the government in shaping security policies. This set includes variables such as race, gender, and location (neighborhood), which could be thoroughly analyzed to understand the dynamics of security across different regions. A pertinent question would be to investigate whether more affluent neighborhoods record crimes at the same proportion as other less privileged areas. This could inspire specific policies to balance this distribution.

Moreover, analyzing gender and race in the incidents could open a dialogue about potential biases in these categories, but it is crucial to maintain a clear focus to avoid deviations from the initial objective of the analysis. Variables of social class add another layer of complexity and should be approached with a defined purpose to prevent divergent debates.

An interesting point noted was the predominance of shootings during nighttime, raising hypotheses that they might be motivated by the absence of police forces on patrols. However, considering that the New York Police Department is known to be well-equipped and trained, such a factor may be less influential than initially presumed.

It is also important to highlight that the dataset is reviewed by the Office of Management Analysis and Planning, which can influence how data is presented. This review can either intensify or soften certain aspects of the data, potentially creating an analytical bias that favors interpretations aligned with political interests, especially in a context where one political party has dominated for years.

These considerations underline the need for careful and objective analysis, always seeking clarity in objectives so that conclusions are based on robust evidence and not on premises influenced by potential political or social biases.